This project explores factors influencing building energy consumption in Southern California. Energy efficiency is a critical concern in modern urban settings, particularly in regions with high electricity demands and varying environmental conditions. By analyzing energy consumption data, we aim to identify patterns and factors that drive electricity usage, offering insights for energy optimization.
Source: Kaggle
Description:
The dataset contains hourly electricity usage data for residential,
commercial, and industrial buildings in Southern California, spanning
from January 2018 to January 2024. It includes over 100 facilities and
integrates information from smart meters, IoT sensors, and utility
companies. Key metrics include electricity usage, weather conditions,
and building characteristics, making it suitable for time-series
analysis, energy forecasting, and studying energy efficiency.
Dataset Summary:
library(ggplot2)
library(dplyr)
data = read.csv("electricity_consumption_optimization_dataset.csv")
str(data)
## 'data.frame': 52585 obs. of 40 variables:
## $ Timestamp : chr "1/01/2018 0:00" "1/01/2018 1:00" "1/01/2018 2:00" "1/01/2018 3:00" ...
## $ Building.Type : chr "Residential" "Industrial" "Commercial" "Residential" ...
## $ Energy.Consumption..kWh. : num 74.7 46.6 58.8 53.6 37.8 ...
## $ Temperature : num 31.4 30.2 19.2 16.7 29.6 ...
## $ Humidity.... : num 62.5 63.1 65 67.4 55.1 ...
## $ Occupancy.Rate.... : num 49.3 65 -16.6 27.4 74.2 ...
## $ Lighting.Consumption..kWh. : num 9.892 11.064 0.582 3.58 17.824 ...
## $ HVAC.Consumption..kWh. : num 9.07 26.49 10.39 8.2 12.27 ...
## $ Energy.Price....kWh. : num 0.0533 0.019 0.0603 0.2093 0.2253 ...
## $ Carbon.Emission.Rate..g.CO2.kWh. : num 342 427 278 692 621 ...
## $ Power.Factor : num 0.932 0.984 0.801 0.921 0.902 ...
## $ Voltage.Levels..V. : num 238 221 238 225 229 ...
## $ Reactive.Power..kVARh. : num 7.98 6.99 3.83 6.6 4.77 ...
## $ Power.Outage.Indicator : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Indoor.Temperature...C. : num 25 19.6 17.5 20.3 24.8 ...
## $ Building.Age..years. : num 31.8 16.5 34.9 25.3 30 ...
## $ Equipment.Age..years. : num 15.36 5.33 7.25 8.24 13.85 ...
## $ Energy.Efficiency.Rating : num 50 53.6 64.1 59.2 82.3 ...
## $ Building.Size.m.2. : num 545.8 1308.2 -47.6 348.6 1240.7 ...
## $ Window.to.Wall.Ratio.... : num 32.3 37.1 22.5 30.2 43.7 ...
## $ Insulation.Quality.Score : num 9.88 2.39 6.86 9.46 10.24 ...
## $ Historical.Energy.Consumption..kWh.: num 81.7 82.3 49.6 3.1 66.1 ...
## $ Maintenance.Status : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Demand.Response.Participation : int 1 0 0 1 0 0 1 0 0 0 ...
## $ Occupancy.Schedule : chr "Occupied" "Occupied" "Vacant" "Occupied" ...
## $ Local.Energy.Production..kWh. : num 6.44 6.45 7.23 11.4 9.9 ...
## $ Grid.Stability.Score : num 80.7 91.5 72.9 91.3 78.2 ...
## $ Solar.Irradiance : num 159 295 135 240 325 ...
## $ Smart.Plug.Usage..kWh. : num 0.3185 0.0998 0.0192 0.3154 0.2355 ...
## $ Water.Usage..liters. : num 102.11 178.5 104.65 8.94 99.79 ...
## $ Energy.Savings.Target.... : num 16.2 16.1 17.5 14.1 21 ...
## $ Room.Level.Energy.Consumption..kWh.: num 13.3 11.8 24.3 24.6 24.5 ...
## $ Zonal.Heating.Cooling.Data..kWh. : num 6.72 7.04 12.87 9.53 10.04 ...
## $ Electric.Vehicle.Charging.Status : int 0 1 0 0 0 0 0 1 0 0 ...
## $ Building.Orientation : chr "South" "South" "North" "South" ...
## $ IoT.Sensor.Count : num 21.4 34.4 67.6 37.7 36.3 ...
## $ Thermal.Comfort.Index : num 80.8 79.7 84.6 96 68.2 ...
## $ Energy.Savings.Potential.... : num 14.12 4.11 4.13 8.68 19.63 ...
## $ Peak.Demand.Reduction.Indicator : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Carbon.Emission.Reduction.Category : chr "Moderate Reduction" "Moderate Reduction" "Moderate Reduction" "No Reduction" ...
` This analysis aims to:
In an era of increasing focus on climate change and energy efficiency, understanding the factors that influence building energy consumption is crucial for sustainable urban development. This dataset offers a valuable opportunity to uncover patterns and develop strategies for optimizing energy usage in Southern California, contributing to more sustainable and efficient energy management practices.
The raw dataset was first loaded and cleaned to prepare it for analysis. The following steps were performed:
Timestamp
column.Energy Consumption,
Solar Irradiance and so on) were checked for invalid
negative values, which were removed.Next, we will first acquire the data and then perform data cleaning
df = read.csv("electricity_consumption_optimization_dataset.csv")
head(df)
## Timestamp Building.Type Energy.Consumption..kWh. Temperature
## 1 1/01/2018 0:00 Residential 74.68 31.36
## 2 1/01/2018 1:00 Industrial 46.59 30.23
## 3 1/01/2018 2:00 Commercial 58.84 19.18
## 4 1/01/2018 3:00 Residential 53.59 16.70
## 5 1/01/2018 4:00 Residential 37.80 29.62
## 6 1/01/2018 5:00 Residential 62.58 27.28
## Humidity.... Occupancy.Rate.... Lighting.Consumption..kWh.
## 1 62.47 49.29 9.8921
## 2 63.07 65.04 11.0637
## 3 65.03 -16.60 0.5823
## 4 67.41 27.40 3.5800
## 5 55.07 74.22 17.8236
## 6 73.05 68.00 10.2164
## HVAC.Consumption..kWh. Energy.Price....kWh. Carbon.Emission.Rate..g.CO2.kWh.
## 1 9.073 0.05330 341.8
## 2 26.488 0.01903 427.3
## 3 10.386 0.06028 278.1
## 4 8.200 0.20932 691.9
## 5 12.268 0.22533 620.5
## 6 11.016 0.21826 631.3
## Power.Factor Voltage.Levels..V. Reactive.Power..kVARh. Power.Outage.Indicator
## 1 0.9321 237.5 7.982 0
## 2 0.9842 221.2 6.990 0
## 3 0.8007 237.6 3.826 0
## 4 0.9213 224.6 6.601 0
## 5 0.9016 229.2 4.767 0
## 6 0.8723 232.4 6.613 0
## Indoor.Temperature...C. Building.Age..years. Equipment.Age..years.
## 1 24.998 31.77 15.357
## 2 19.593 16.46 5.326
## 3 17.459 34.93 7.254
## 4 20.344 25.27 8.244
## 5 24.778 30.00 13.851
## 6 9.747 33.51 21.895
## Energy.Efficiency.Rating Building.Size.m.2. Window.to.Wall.Ratio....
## 1 50.02 545.82 32.26
## 2 53.58 1308.15 37.11
## 3 64.05 -47.62 22.45
## 4 59.22 348.65 30.16
## 5 82.28 1240.68 43.70
## 6 65.58 1336.67 35.58
## Insulation.Quality.Score Historical.Energy.Consumption..kWh.
## 1 9.877 81.70
## 2 2.385 82.27
## 3 6.860 49.56
## 4 9.456 3.10
## 5 10.239 66.12
## 6 7.481 37.22
## Maintenance.Status Demand.Response.Participation Occupancy.Schedule
## 1 0 1 Occupied
## 2 0 0 Occupied
## 3 0 0 Vacant
## 4 0 1 Occupied
## 5 0 0 Occupied
## 6 0 0 Occupied
## Local.Energy.Production..kWh. Grid.Stability.Score Solar.Irradiance
## 1 6.437 80.67 159.0
## 2 6.454 91.47 294.8
## 3 7.226 72.85 134.9
## 4 11.395 91.34 239.7
## 5 9.901 78.19 325.3
## 6 6.724 75.82 145.2
## Smart.Plug.Usage..kWh. Water.Usage..liters. Energy.Savings.Target....
## 1 0.31852 102.114 16.21
## 2 0.09984 178.497 16.08
## 3 0.01916 104.648 17.46
## 4 0.31539 8.939 14.13
## 5 0.23552 99.788 20.96
## 6 0.12821 125.938 15.68
## Room.Level.Energy.Consumption..kWh. Zonal.Heating.Cooling.Data..kWh.
## 1 13.34 6.720
## 2 11.75 7.041
## 3 24.30 12.874
## 4 24.59 9.527
## 5 24.48 10.040
## 6 33.10 21.046
## Electric.Vehicle.Charging.Status Building.Orientation IoT.Sensor.Count
## 1 0 South 21.43
## 2 1 South 34.39
## 3 0 North 67.59
## 4 0 South 37.71
## 5 0 North 36.32
## 6 0 East 58.87
## Thermal.Comfort.Index Energy.Savings.Potential....
## 1 80.81 14.115
## 2 79.68 4.108
## 3 84.57 4.131
## 4 95.95 8.682
## 5 68.23 19.632
## 6 65.30 5.696
## Peak.Demand.Reduction.Indicator Carbon.Emission.Reduction.Category
## 1 0 Moderate Reduction
## 2 0 Moderate Reduction
## 3 0 Moderate Reduction
## 4 0 No Reduction
## 5 0 Moderate Reduction
## 6 0 Low Reduction
names(df)
## [1] "Timestamp" "Building.Type"
## [3] "Energy.Consumption..kWh." "Temperature"
## [5] "Humidity...." "Occupancy.Rate...."
## [7] "Lighting.Consumption..kWh." "HVAC.Consumption..kWh."
## [9] "Energy.Price....kWh." "Carbon.Emission.Rate..g.CO2.kWh."
## [11] "Power.Factor" "Voltage.Levels..V."
## [13] "Reactive.Power..kVARh." "Power.Outage.Indicator"
## [15] "Indoor.Temperature...C." "Building.Age..years."
## [17] "Equipment.Age..years." "Energy.Efficiency.Rating"
## [19] "Building.Size.m.2." "Window.to.Wall.Ratio...."
## [21] "Insulation.Quality.Score" "Historical.Energy.Consumption..kWh."
## [23] "Maintenance.Status" "Demand.Response.Participation"
## [25] "Occupancy.Schedule" "Local.Energy.Production..kWh."
## [27] "Grid.Stability.Score" "Solar.Irradiance"
## [29] "Smart.Plug.Usage..kWh." "Water.Usage..liters."
## [31] "Energy.Savings.Target...." "Room.Level.Energy.Consumption..kWh."
## [33] "Zonal.Heating.Cooling.Data..kWh." "Electric.Vehicle.Charging.Status"
## [35] "Building.Orientation" "IoT.Sensor.Count"
## [37] "Thermal.Comfort.Index" "Energy.Savings.Potential...."
## [39] "Peak.Demand.Reduction.Indicator" "Carbon.Emission.Reduction.Category"
# Load required libraries
library(lubridate)
# Convert Timestamp to proper datetime format
df$Timestamp = dmy_hm(df$Timestamp) # This handles dd/mm/yyyy HH:MM format
# Filter for full year 2023 and keep all columns
df_2023 = df %>%
filter(Timestamp >= as.POSIXct("2023-01-01 00:00:00") &
Timestamp <= as.POSIXct("2023-12-31 23:59:59")) %>%
select(
'Timestamp',
'Building.Type',
'Energy.Consumption..kWh.',
'Temperature',
'Solar.Irradiance',
'HVAC.Consumption..kWh.',
'Lighting.Consumption..kWh.',
'Peak.Demand.Reduction.Indicator',
'Energy.Price....kWh.',
'Building.Age..years.',
'Building.Size.m.2.',
'Carbon.Emission.Reduction.Category'
)
# set seed to ensure reproducibility
set.seed(123)
final_df = df_2023 %>%
slice_sample(n = 2000)
# Save the filtered data to a new CSV file
write.csv(final_df, "energy_data_20231.csv", row.names = FALSE)
# Check the date range in the final dataset
range(final_df$Timestamp)
## [1] "2023-01-01 06:00:00 UTC" "2023-12-31 22:00:00 UTC"
# Check dimensions of the new dataset
dim(final_df)
## [1] 2000 12
# Load the cleaned data
energy_data = read.csv("energy_data_20231.csv")
# Convert negative values to NA for numeric columns (except Peak.Demand.Reduction.Indicator)
clean_data = energy_data %>%
mutate(
Energy.Consumption..kWh. = ifelse(Energy.Consumption..kWh. < 0, NA, Energy.Consumption..kWh.),
Solar.Irradiance = ifelse(Solar.Irradiance < 0, NA, Solar.Irradiance),
HVAC.Consumption..kWh. = ifelse(HVAC.Consumption..kWh. < 0, NA, HVAC.Consumption..kWh.),
Lighting.Consumption..kWh. = ifelse(Lighting.Consumption..kWh. < 0, NA, Lighting.Consumption..kWh.),
Energy.Price....kWh. = ifelse(Energy.Price....kWh. < 0, NA, Energy.Price....kWh.),
Building.Age..years. = ifelse(Building.Age..years. < 0, NA, Building.Age..years.),
Building.Size.m.2. = ifelse(Building.Size.m.2. < 0, NA, Building.Size.m.2.)
)
# Remove rows with any NA values
final_clean_data = clean_data %>%
na.omit()
# Ensure the output column names are consistent
final_clean_data = final_clean_data %>%
select(
Timestamp,
Building.Type,
Energy.Consumption..kWh.,
Temperature,
Solar.Irradiance,
HVAC.Consumption..kWh.,
Lighting.Consumption..kWh.,
Peak.Demand.Reduction.Indicator,
Energy.Price....kWh.,
Building.Age..years.,
Building.Size.m.2.,
Carbon.Emission.Reduction.Category
)
# Save the cleaned data
write.csv(final_clean_data, "energy_data_2023_clean.csv", row.names = FALSE)
# Check how many rows were removed
cat("Original number of rows:", nrow(energy_data), "\n")
## Original number of rows: 2000
cat("Number of rows after removing NA values:", nrow(final_clean_data), "\n")
## Number of rows after removing NA values: 1842
cat("Number of rows removed:", nrow(energy_data) - nrow(final_clean_data), "\n")
## Number of rows removed: 158
# View summary statistics of numeric columns
summary(final_clean_data)
## Timestamp Building.Type Energy.Consumption..kWh. Temperature
## Length:1842 Length:1842 Min. : 1.47 Min. :-7.7
## Class :character Class :character 1st Qu.: 40.81 1st Qu.:14.4
## Mode :character Mode :character Median : 54.16 Median :21.1
## Mean : 54.30 Mean :21.1
## 3rd Qu.: 67.91 3rd Qu.:27.8
## Max. :119.42 Max. :62.7
## Solar.Irradiance HVAC.Consumption..kWh. Lighting.Consumption..kWh.
## Min. : 0.1 Min. : 0.07 Min. : 0.087
## 1st Qu.:151.9 1st Qu.:11.67 1st Qu.: 7.902
## Median :220.3 Median :16.60 Median :11.316
## Mean :223.9 Mean :16.35 Mean :11.320
## 3rd Qu.:292.1 3rd Qu.:20.87 3rd Qu.:14.615
## Max. :527.7 Max. :39.58 Max. :28.845
## Peak.Demand.Reduction.Indicator Energy.Price....kWh. Building.Age..years.
## Min. :0.000 Min. :0.002 Min. : 0.2
## 1st Qu.:0.000 1st Qu.:0.123 1st Qu.:15.0
## Median :0.000 Median :0.156 Median :21.6
## Mean :0.151 Mean :0.155 Mean :22.0
## 3rd Qu.:0.000 3rd Qu.:0.188 3rd Qu.:28.5
## Max. :1.000 Max. :0.331 Max. :51.7
## Building.Size.m.2. Carbon.Emission.Reduction.Category
## Min. : 23.9 Length:1842
## 1st Qu.: 805.0 Class :character
## Median :1158.7 Mode :character
## Mean :1156.6
## 3rd Qu.:1503.4
## Max. :2654.9
# Check dimensions of final dataset
dim(final_clean_data)
## [1] 1842 12
To understand the characteristics of the data and relationships between variables, we performed the following analyses:
# Descriptive statistics
summary(final_clean_data)
## Timestamp Building.Type Energy.Consumption..kWh. Temperature
## Length:1842 Length:1842 Min. : 1.47 Min. :-7.7
## Class :character Class :character 1st Qu.: 40.81 1st Qu.:14.4
## Mode :character Mode :character Median : 54.16 Median :21.1
## Mean : 54.30 Mean :21.1
## 3rd Qu.: 67.91 3rd Qu.:27.8
## Max. :119.42 Max. :62.7
## Solar.Irradiance HVAC.Consumption..kWh. Lighting.Consumption..kWh.
## Min. : 0.1 Min. : 0.07 Min. : 0.087
## 1st Qu.:151.9 1st Qu.:11.67 1st Qu.: 7.902
## Median :220.3 Median :16.60 Median :11.316
## Mean :223.9 Mean :16.35 Mean :11.320
## 3rd Qu.:292.1 3rd Qu.:20.87 3rd Qu.:14.615
## Max. :527.7 Max. :39.58 Max. :28.845
## Peak.Demand.Reduction.Indicator Energy.Price....kWh. Building.Age..years.
## Min. :0.000 Min. :0.002 Min. : 0.2
## 1st Qu.:0.000 1st Qu.:0.123 1st Qu.:15.0
## Median :0.000 Median :0.156 Median :21.6
## Mean :0.151 Mean :0.155 Mean :22.0
## 3rd Qu.:0.000 3rd Qu.:0.188 3rd Qu.:28.5
## Max. :1.000 Max. :0.331 Max. :51.7
## Building.Size.m.2. Carbon.Emission.Reduction.Category
## Min. : 23.9 Length:1842
## 1st Qu.: 805.0 Class :character
## Median :1158.7 Mode :character
## Mean :1156.6
## 3rd Qu.:1503.4
## Max. :2654.9
# Standard deviation for numeric columns
numeric_columns = final_clean_data %>%
select(where(is.numeric))
sapply(numeric_columns, sd, na.rm = TRUE)
## Energy.Consumption..kWh. Temperature
## 19.47877 10.00629
## Solar.Irradiance HVAC.Consumption..kWh.
## 100.21482 6.64073
## Lighting.Consumption..kWh. Peak.Demand.Reduction.Indicator
## 4.89607 0.35860
## Energy.Price....kWh. Building.Age..years.
## 0.05033 9.53429
## Building.Size.m.2.
## 492.26153
The dataset was divided into a training set (80%) and a testing set (20%) to ensure robust evaluation of model performance.
# Set seed for reproducibility
set.seed(123)
# Split the data into training (80%) and testing (20%) sets
train_indices = sample(1:nrow(final_clean_data), size = 0.8 * nrow(final_clean_data))
train_data = final_clean_data[train_indices, ]
test_data = final_clean_data[-train_indices, ]
# Check dimensions of the splits
dim(train_data) # Training set
## [1] 1473 12
dim(test_data) # Testing set
## [1] 369 12
After splitting the dataset, the training set will be used to explore various modeling approaches, including simple and multiple linear regression, interaction terms, and transformations. The testing set will be used to evaluate model performance using metrics such as RMSE and R-squared, ensuring the model’s applicability to unseen data.
Histograms and density plots were created to examine the distribution of key variables. Boxplots were used to explore the relationship between the categorical variable Building.Type and the response variable Energy.Consumption..kWh.. All of which are examined to enhance the understanding of the energy consumption dataset.
# Histogram
ggplot(final_clean_data, aes(x = Energy.Consumption..kWh.)) +
geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
labs(title = "Distribution of Energy Consumption", x = "Energy Consumption (kWh)", y = "Frequency")
The energy consumption data follows an approximately normal distribution with a slight right skew. The peak occurs around 50-60 kWh, with consumption values ranging from 0 to 120 kWh. Most households consume between 20-100 kWh, though some outliers show notably higher usage. CopyRetry
# Density plot for Temperature
ggplot(final_clean_data, aes(x = Temperature)) +
geom_density(fill = "green", alpha = 0.5) +
labs(title = "Density Plot of Temperature", x = "Temperature (°C)", y = "Density")
The temperature distribution shows a slight right skew, with values ranging from -7-62°C and peaking at around 20°C. Most temperatures fall between 0-40°C, though some higher temperature readings are observed.
# Boxplot for Building Type vs Energy Consumption
ggplot(final_clean_data, aes(x = Building.Type, y = Energy.Consumption..kWh., fill = Building.Type)) +
geom_boxplot() +
labs(title = "Energy Consumption by Building Type", x = "Building Type", y = "Energy Consumption (kWh)")
The boxplot compares energy consumption across three building types: commercial, industrial, and residential. All three categories show similar median consumption around 50-55 kWh, implying the energy consumption for all three building types are mostly the same. The spread of consumption is also comparable across types, though residential buildings exhibit more outliers at higher consumption levels. The interquartile ranges span roughly from 40-70 kWh for all building types.
In the pairs plot, strong correlations should appear as a clear trend (e.g., points aligned along a line, sloping upwards for positive correlation).
However, in the scatterplots involving the response variable, the points appear to form random clouds rather than a strong trend.
The corresponding correlation coefficients (small values close to zero in the upper triangle) and the correlation matrix confirm that no variables show a strong correlation with the response variable.
library(faraway)
# Select numeric columns for analysis
numeric_columns = final_clean_data %>%
select(Energy.Consumption..kWh.,
Temperature,
Solar.Irradiance,
HVAC.Consumption..kWh.,
Lighting.Consumption..kWh.,
Energy.Price....kWh.,
Building.Age..years.,
Building.Size.m.2.)
# Custom panel.cor function to add correlation coefficients
panel.cor = function(x, y, digits = 2, prefix = "", cex.cor = 0.8, ...) {
usr = par("usr")
on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r = cor(x, y, use = "complete.obs")
txt = format(c(r, 0.123456789), digits = digits)[1]
txt = paste0(prefix, txt)
text(0.5, 0.5, txt, cex = cex.cor)
}
# Generate the pairs plot in the Results section
pairs(numeric_columns,
col = "dodgerblue",
pch = 16,
cex = 0.5,
gap = 0,
upper.panel = panel.cor,
lower.panel = panel.smooth)
The dataset was successfully split into training and testing sets to prepare for model building and validation: - Training Set: 80% of the data, containing 1473 rows. - Testing Set: 20% of the data, containing 369 rows.
The following histogram illustrates the distribution of
Energy.Consumption..kWh. in both the training and testing
sets. The distributions are similar, indicating that the split has
maintained the representativeness of the original dataset.
# Combine training and testing sets for visualization
train_data$Set = "Training Set"
test_data$Set = "Testing Set"
combined_data = rbind(train_data, test_data)
# Histogram comparison
ggplot(combined_data, aes(x = Energy.Consumption..kWh., fill = Set)) +
geom_histogram(bins = 30, alpha = 0.7, position = "identity", color = "black") +
labs(title = "Energy Consumption Distribution in Training vs. Testing Sets",
x = "Energy Consumption (kWh)", y = "Frequency") +
scale_fill_manual(name = "Dataset", values = c("Training Set" = "lightblue", "Testing Set" = "red")) +
theme_minimal()
train_summary = train_data %>% summarise(across(where(is.numeric), list(mean = mean, sd = sd), na.rm = TRUE))
test_summary = test_data %>% summarise(across(where(is.numeric), list(mean = mean, sd = sd), na.rm = TRUE))
print(train_summary)
## Energy.Consumption..kWh._mean Energy.Consumption..kWh._sd Temperature_mean
## 1 54.2 19.53 21.05
## Temperature_sd Solar.Irradiance_mean Solar.Irradiance_sd
## 1 10.04 222.8 100.4
## HVAC.Consumption..kWh._mean HVAC.Consumption..kWh._sd
## 1 16.28 6.688
## Lighting.Consumption..kWh._mean Lighting.Consumption..kWh._sd
## 1 11.31 4.935
## Peak.Demand.Reduction.Indicator_mean Peak.Demand.Reduction.Indicator_sd
## 1 0.1385 0.3455
## Energy.Price....kWh._mean Energy.Price....kWh._sd Building.Age..years._mean
## 1 0.1538 0.05037 22.08
## Building.Age..years._sd Building.Size.m.2._mean Building.Size.m.2._sd
## 1 9.631 1163 490.5
print(test_summary)
## Energy.Consumption..kWh._mean Energy.Consumption..kWh._sd Temperature_mean
## 1 54.7 19.31 21.28
## Temperature_sd Solar.Irradiance_mean Solar.Irradiance_sd
## 1 9.877 228.1 99.38
## HVAC.Consumption..kWh._mean HVAC.Consumption..kWh._sd
## 1 16.63 6.452
## Lighting.Consumption..kWh._mean Lighting.Consumption..kWh._sd
## 1 11.37 4.742
## Peak.Demand.Reduction.Indicator_mean Peak.Demand.Reduction.Indicator_sd
## 1 0.2033 0.403
## Energy.Price....kWh._mean Energy.Price....kWh._sd Building.Age..years._mean
## 1 0.1594 0.04997 21.61
## Building.Age..years._sd Building.Size.m.2._mean Building.Size.m.2._sd
## 1 9.14 1130 499.1
The initial descriptive statistics provided insights into the central tendencies and variabilities of the key variables in the dataset. Specifically:
Energy Consumption (kWh): The mean energy consumption was approximately 54.3 kWh, with a standard deviation of 19.48 kWh, indicating moderate variability in energy usage across the analyzed buildings.
Temperature (°C): The average temperature was 21.1°C, with a standard deviation of 10.01°C. This highlights the diverse climate conditions in Southern California over the analyzed period.
HVAC and Lighting Consumption: These two predictors showed relatively high means (16.35 kWh and 11.32 kWh, respectively) with lower variability compared to other predictors, suggesting their consistent contributions to overall energy consumption.
These statistics provided a foundational understanding of the dataset and highlighted variables likely to have significant effects on energy consumption.
The training and testing sets were split with an 80/20 ratio,
maintaining the overall distribution of
Energy.Consumption..kWh.. Consistency in means and standard
deviations across both sets confirms their representativeness. This
ensures the training set provides robust data for model building, while
the testing set allows unbiased performance evaluation, creating a solid
foundation for predictive modeling.
Based on our correlation analysis from earlier sections, we’ll begin by exploring the relationship between Energy Consumption and HVAC Consumption using simple linear regression. This will help us understand the baseline relationship before moving to more complex models.
# Fit SLR model using HVAC consumption as predictor
slr_model = lm(Energy.Consumption..kWh. ~ HVAC.Consumption..kWh., data = train_data)
summary(slr_model)
##
## Call:
## lm(formula = Energy.Consumption..kWh. ~ HVAC.Consumption..kWh.,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.16 -13.41 -0.14 13.51 64.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.1341 1.3398 41.15 <2e-16 ***
## HVAC.Consumption..kWh. -0.0575 0.0761 -0.75 0.45
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.5 on 1471 degrees of freedom
## Multiple R-squared: 0.000387, Adjusted R-squared: -0.000292
## F-statistic: 0.57 on 1 and 1471 DF, p-value: 0.45
The simple linear regression results show surprisingly weak relationship between HVAC consumption and total energy consumption:
Let’s visualize this relationship:
# Plot the relationship
ggplot(train_data, aes(x = HVAC.Consumption..kWh., y = Energy.Consumption..kWh.)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Energy Consumption vs HVAC Consumption",
x = "HVAC Consumption (kWh)",
y = "Energy Consumption (kWh)")
The scatter plot confirms our statistical findings, showing a nearly flat regression line and widely scattered points, suggesting that the relationship between HVAC consumption and total energy consumption is not linear, or that other factors may be more important in determining total energy consumption.
After examining HVAC consumption, we’ll investigate the relationship between temperature and energy consumption, as temperature is often considered a key driver of building energy use through its impact on heating and cooling needs.
# SLR with Temperature
slr_temp = lm(Energy.Consumption..kWh. ~ Temperature, data = train_data)
summary(slr_temp)
##
## Call:
## lm(formula = Energy.Consumption..kWh. ~ Temperature, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.57 -13.44 -0.15 13.42 65.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.1491 1.1821 46.65 <2e-16 ***
## Temperature -0.0452 0.0507 -0.89 0.37
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.5 on 1471 degrees of freedom
## Multiple R-squared: 0.000539, Adjusted R-squared: -0.00014
## F-statistic: 0.794 on 1 and 1471 DF, p-value: 0.373
# Plot Temperature relationship
ggplot(train_data, aes(x = Temperature, y = Energy.Consumption..kWh.)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Energy Consumption vs Temperature",
x = "Temperature (°C)",
y = "Energy Consumption (kWh)")
The temperature-based SLR model shows:
Next, we’ll examine how building size relates to energy consumption, as larger buildings might be expected to consume more energy for lighting, heating, and cooling.
# SLR with Building Size
slr_size = lm(Energy.Consumption..kWh. ~ Building.Size.m.2., data = train_data)
summary(slr_size)
##
## Call:
## lm(formula = Energy.Consumption..kWh. ~ Building.Size.m.2., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.73 -13.17 -0.46 13.45 64.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.77997 1.30804 43.41 <2e-16 ***
## Building.Size.m.2. -0.00222 0.00104 -2.14 0.032 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.5 on 1471 degrees of freedom
## Multiple R-squared: 0.00311, Adjusted R-squared: 0.00243
## F-statistic: 4.59 on 1 and 1471 DF, p-value: 0.0324
# Plot Building Size relationship
ggplot(train_data, aes(x = Building.Size.m.2., y = Energy.Consumption..kWh.)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Energy Consumption vs Building Size",
x = "Building Size (m²)",
y = "Energy Consumption (kWh)")
The building size model reveals:
Finally, we’ll analyze the relationship between lighting consumption and total energy consumption, as lighting is typically a significant component of building energy use.
# SLR with Lighting Consumption
slr_light = lm(Energy.Consumption..kWh. ~ Lighting.Consumption..kWh., data = train_data)
summary(slr_light)
##
## Call:
## lm(formula = Energy.Consumption..kWh. ~ Lighting.Consumption..kWh.,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.52 -13.37 -0.26 13.50 65.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.7659 1.2725 42.25 <2e-16 ***
## Lighting.Consumption..kWh. 0.0382 0.1031 0.37 0.71
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.5 on 1471 degrees of freedom
## Multiple R-squared: 9.35e-05, Adjusted R-squared: -0.000586
## F-statistic: 0.138 on 1 and 1471 DF, p-value: 0.711
# Plot Lighting Consumption relationship
ggplot(train_data, aes(x = Lighting.Consumption..kWh., y = Energy.Consumption..kWh.)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Energy Consumption vs Lighting Consumption",
x = "Lighting Consumption (kWh)",
y = "Energy Consumption (kWh)")
The lighting consumption model shows:
After examining four different predictors (HVAC, Temperature, Building Size, and Lighting Consumption) through simple linear regression:
These findings strongly suggest we should proceed with multiple regression analysis and consider non-linear relationships or interactions between variables.
Given the poor performance of the simple linear regression model, let’s extend our analysis to include multiple predictors that might better explain the variation in energy consumption.
We’ll start by including all relevant numeric predictors to see their combined effect on energy consumption:
# Fit full MLR model
mlr_full = lm(Energy.Consumption..kWh. ~ Temperature + Solar.Irradiance +
HVAC.Consumption..kWh. + Lighting.Consumption..kWh. +
Energy.Price....kWh. + Building.Age..years. + Building.Size.m.2. +
Peak.Demand.Reduction.Indicator, data = train_data)
summary(mlr_full)
##
## Call:
## lm(formula = Energy.Consumption..kWh. ~ Temperature + Solar.Irradiance +
## HVAC.Consumption..kWh. + Lighting.Consumption..kWh. + Energy.Price....kWh. +
## Building.Age..years. + Building.Size.m.2. + Peak.Demand.Reduction.Indicator,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.07 -13.39 -0.26 13.34 64.43
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.66241 3.37233 17.40 <2e-16 ***
## Temperature -0.03829 0.05084 -0.75 0.45
## Solar.Irradiance -0.00320 0.00508 -0.63 0.53
## HVAC.Consumption..kWh. -0.05800 0.07622 -0.76 0.45
## Lighting.Consumption..kWh. 0.02953 0.10344 0.29 0.78
## Energy.Price....kWh. -1.27263 10.12715 -0.13 0.90
## Building.Age..years. 0.03258 0.05289 0.62 0.54
## Building.Size.m.2. -0.00226 0.00104 -2.18 0.03 *
## Peak.Demand.Reduction.Indicator -1.61705 1.47761 -1.09 0.27
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.5 on 1464 degrees of freedom
## Multiple R-squared: 0.00547, Adjusted R-squared: 3.39e-05
## F-statistic: 1.01 on 8 and 1464 DF, p-value: 0.429
The full multiple regression model results show
Before making any conclusions about the model, we should check for multicollinearity among predictors:
library(car)
vif(mlr_full)
## Temperature Solar.Irradiance
## 1.006 1.006
## HVAC.Consumption..kWh. Lighting.Consumption..kWh.
## 1.003 1.006
## Energy.Price....kWh. Building.Age..years.
## 1.005 1.002
## Building.Size.m.2. Peak.Demand.Reduction.Indicator
## 1.005 1.007
The VIF analysis shows:
Despite the low VIF values, let’s try a reduced model focusing on the most theoretically relevant predictors:
mlr_reduced = lm(Energy.Consumption..kWh. ~ Temperature + Solar.Irradiance +
HVAC.Consumption..kWh. + Energy.Price....kWh. +
Building.Age..years. + Peak.Demand.Reduction.Indicator,
data = train_data)
vif(mlr_reduced)
## Temperature Solar.Irradiance
## 1.004 1.002
## HVAC.Consumption..kWh. Energy.Price....kWh.
## 1.002 1.002
## Building.Age..years. Peak.Demand.Reduction.Indicator
## 1.002 1.005
The reduced model maintains low VIF values, confirming the absence of multicollinearity.
Next, let’s examine whether our model is being affected by influential observations:
# Calculate Cook's distance
cooks_d = cooks.distance(mlr_reduced)
plot(cooks_d, type = "h", main = "Cook's Distance Plot")
abline(h = 4/length(cooks_d), col = "red")
# Identify influential points
influential = which(cooks_d > 4/length(cooks_d))
length(influential)
## [1] 65
The Cook’s distance plot reveals:
Let’s refit the model without these influential points:
# Fit model without influential points
mlr_no_influential = lm(Energy.Consumption..kWh. ~ Temperature + Solar.Irradiance +
HVAC.Consumption..kWh. + Lighting.Consumption..kWh. +
Energy.Price....kWh. + Building.Age..years. + Building.Size.m.2. +
Peak.Demand.Reduction.Indicator,
data = train_data[-influential,])
summary(mlr_no_influential)
##
## Call:
## lm(formula = Energy.Consumption..kWh. ~ Temperature + Solar.Irradiance +
## HVAC.Consumption..kWh. + Lighting.Consumption..kWh. + Energy.Price....kWh. +
## Building.Age..years. + Building.Size.m.2. + Peak.Demand.Reduction.Indicator,
## data = train_data[-influential, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.27 -12.50 0.03 12.63 59.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.329752 3.149725 17.88 <2e-16 ***
## Temperature -0.010208 0.047955 -0.21 0.831
## Solar.Irradiance -0.004330 0.004742 -0.91 0.361
## HVAC.Consumption..kWh. 0.018131 0.070710 0.26 0.798
## Lighting.Consumption..kWh. -0.014866 0.094863 -0.16 0.875
## Energy.Price....kWh. -1.014870 9.437111 -0.11 0.914
## Building.Age..years. 0.059129 0.049599 1.19 0.233
## Building.Size.m.2. -0.001787 0.000958 -1.87 0.062 .
## Peak.Demand.Reduction.Indicator -2.675955 1.393900 -1.92 0.055 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.6 on 1399 degrees of freedom
## Multiple R-squared: 0.00664, Adjusted R-squared: 0.000959
## F-statistic: 1.17 on 8 and 1399 DF, p-value: 0.314
After removing influential points and using all predictors:
Let’s check the assumptions of our model:
# Check normality assumption
qqnorm(resid(mlr_no_influential))
qqline(resid(mlr_no_influential))
# Check constant variance
plot(fitted(mlr_no_influential), resid(mlr_no_influential),
xlab = "Fitted values", ylab = "Residuals",
main = "Residuals vs Fitted Values")
abline(h = 0, col = "red")
The diagnostic plots reveal:
These results suggest we might need to consider:
# Perform Box-Cox transformation analysis
library(MASS)
bc = boxcox(mlr_no_influential)
lambda = bc$x[which.max(bc$y)]
print(lambda)
## [1] 0.9091
Since lambda is 0.91 which is close to 1, we can try both no transformation and a log transformation to compare which model performs better.
# Log model
mlr_log = lm(log(Energy.Consumption..kWh.) ~ Temperature + Solar.Irradiance +
HVAC.Consumption..kWh. + Lighting.Consumption..kWh. +
Energy.Price....kWh. + Building.Age..years. + Building.Size.m.2. +
Peak.Demand.Reduction.Indicator,
data = train_data[-influential,])
summary(mlr_log)
##
## Call:
## lm(formula = log(Energy.Consumption..kWh.) ~ Temperature + Solar.Irradiance +
## HVAC.Consumption..kWh. + Lighting.Consumption..kWh. + Energy.Price....kWh. +
## Building.Age..years. + Building.Size.m.2. + Peak.Demand.Reduction.Indicator,
## data = train_data[-influential, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2325 -0.1987 0.0616 0.2731 0.7975
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.9796932 0.0677435 58.75 <2e-16 ***
## Temperature -0.0001270 0.0010314 -0.12 0.902
## Solar.Irradiance -0.0001093 0.0001020 -1.07 0.284
## HVAC.Consumption..kWh. -0.0000576 0.0015208 -0.04 0.970
## Lighting.Consumption..kWh. -0.0004536 0.0020403 -0.22 0.824
## Energy.Price....kWh. -0.0153366 0.2029711 -0.08 0.940
## Building.Age..years. 0.0013022 0.0010668 1.22 0.222
## Building.Size.m.2. -0.0000377 0.0000206 -1.83 0.068 .
## Peak.Demand.Reduction.Indicator -0.0252638 0.0299797 -0.84 0.400
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.379 on 1399 degrees of freedom
## Multiple R-squared: 0.00478, Adjusted R-squared: -0.000907
## F-statistic: 0.841 on 8 and 1399 DF, p-value: 0.567
# Check diagnostics for both models
# Original model diagnostics
par(mfrow=c(1,2))
plot(mlr_no_influential, which=1)
plot(mlr_no_influential, which=2)
# Log transformed model diagnostics
par(mfrow=c(1,2))
plot(mlr_log, which=1)
plot(mlr_log, which=2)
# Formal tests for both models
library(lmtest)
# Original model tests
shapiro.test(resid(mlr_no_influential))
##
## Shapiro-Wilk normality test
##
## data: resid(mlr_no_influential)
## W = 1, p-value = 0.007
bptest(mlr_no_influential)
##
## studentized Breusch-Pagan test
##
## data: mlr_no_influential
## BP = 17, df = 8, p-value = 0.03
# Log transformed model tests
shapiro.test(resid(mlr_log))
##
## Shapiro-Wilk normality test
##
## data: resid(mlr_log)
## W = 0.94, p-value <2e-16
bptest(mlr_log)
##
## studentized Breusch-Pagan test
##
## data: mlr_log
## BP = 12, df = 8, p-value = 0.2
From bptest, swtest, fitted vs residuals plot and qq plot, we can conclude that the original model without influential points is better.
# Test interactions between temperature and HVAC consumption
# (since temperature likely affects HVAC usage)
mlr_interaction1 = lm(Energy.Consumption..kWh. ~ Temperature * HVAC.Consumption..kWh. +
Solar.Irradiance + Lighting.Consumption..kWh. +
Energy.Price....kWh. + Building.Age..years. + Building.Size.m.2. +
Peak.Demand.Reduction.Indicator,
data = train_data[-influential,])
# Test interactions between temperature and building size
# (since larger buildings might be more affected by temperature changes)
mlr_interaction2 = lm(Energy.Consumption..kWh. ~ Temperature * Building.Size.m.2. +
HVAC.Consumption..kWh. + Solar.Irradiance +
Lighting.Consumption..kWh. + Energy.Price....kWh. +
Building.Age..years. + Peak.Demand.Reduction.Indicator,
data = train_data[-influential,])
# Test interactions between solar irradiance and HVAC consumption
# (since solar heat might affect HVAC needs)
mlr_interaction3 = lm(Energy.Consumption..kWh. ~ Solar.Irradiance * HVAC.Consumption..kWh. +
Temperature + Lighting.Consumption..kWh. +
Energy.Price....kWh. + Building.Age..years. + Building.Size.m.2. +
Peak.Demand.Reduction.Indicator,
data = train_data[-influential,])
# Compare models using anova
anova(mlr_no_influential, mlr_interaction1)
## Analysis of Variance Table
##
## Model 1: Energy.Consumption..kWh. ~ Temperature + Solar.Irradiance + HVAC.Consumption..kWh. +
## Lighting.Consumption..kWh. + Energy.Price....kWh. + Building.Age..years. +
## Building.Size.m.2. + Peak.Demand.Reduction.Indicator
## Model 2: Energy.Consumption..kWh. ~ Temperature * HVAC.Consumption..kWh. +
## Solar.Irradiance + Lighting.Consumption..kWh. + Energy.Price....kWh. +
## Building.Age..years. + Building.Size.m.2. + Peak.Demand.Reduction.Indicator
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1399 433357
## 2 1398 433308 1 48.8 0.16 0.69
anova(mlr_no_influential, mlr_interaction2)
## Analysis of Variance Table
##
## Model 1: Energy.Consumption..kWh. ~ Temperature + Solar.Irradiance + HVAC.Consumption..kWh. +
## Lighting.Consumption..kWh. + Energy.Price....kWh. + Building.Age..years. +
## Building.Size.m.2. + Peak.Demand.Reduction.Indicator
## Model 2: Energy.Consumption..kWh. ~ Temperature * Building.Size.m.2. +
## HVAC.Consumption..kWh. + Solar.Irradiance + Lighting.Consumption..kWh. +
## Energy.Price....kWh. + Building.Age..years. + Peak.Demand.Reduction.Indicator
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1399 433357
## 2 1398 433329 1 28 0.09 0.76
anova(mlr_no_influential, mlr_interaction3)
## Analysis of Variance Table
##
## Model 1: Energy.Consumption..kWh. ~ Temperature + Solar.Irradiance + HVAC.Consumption..kWh. +
## Lighting.Consumption..kWh. + Energy.Price....kWh. + Building.Age..years. +
## Building.Size.m.2. + Peak.Demand.Reduction.Indicator
## Model 2: Energy.Consumption..kWh. ~ Solar.Irradiance * HVAC.Consumption..kWh. +
## Temperature + Lighting.Consumption..kWh. + Energy.Price....kWh. +
## Building.Age..years. + Building.Size.m.2. + Peak.Demand.Reduction.Indicator
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1399 433357
## 2 1398 433282 1 75 0.24 0.62
# Summary of each interaction model
summary(mlr_interaction1)
##
## Call:
## lm(formula = Energy.Consumption..kWh. ~ Temperature * HVAC.Consumption..kWh. +
## Solar.Irradiance + Lighting.Consumption..kWh. + Energy.Price....kWh. +
## Building.Age..years. + Building.Size.m.2. + Peak.Demand.Reduction.Indicator,
## data = train_data[-influential, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.31 -12.55 -0.11 12.68 59.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.230319 3.882802 14.74 <2e-16 ***
## Temperature -0.052732 0.117398 -0.45 0.653
## HVAC.Consumption..kWh. -0.038111 0.158389 -0.24 0.810
## Solar.Irradiance -0.004364 0.004744 -0.92 0.358
## Lighting.Consumption..kWh. -0.015065 0.094893 -0.16 0.874
## Energy.Price....kWh. -0.993956 9.440101 -0.11 0.916
## Building.Age..years. 0.058610 0.049631 1.18 0.238
## Building.Size.m.2. -0.001795 0.000958 -1.87 0.061 .
## Peak.Demand.Reduction.Indicator -2.679632 1.394351 -1.92 0.055 .
## Temperature:HVAC.Consumption..kWh. 0.002731 0.006881 0.40 0.692
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.6 on 1398 degrees of freedom
## Multiple R-squared: 0.00675, Adjusted R-squared: 0.000357
## F-statistic: 1.06 on 9 and 1398 DF, p-value: 0.393
summary(mlr_interaction2)
##
## Call:
## lm(formula = Energy.Consumption..kWh. ~ Temperature * Building.Size.m.2. +
## HVAC.Consumption..kWh. + Solar.Irradiance + Lighting.Consumption..kWh. +
## Energy.Price....kWh. + Building.Age..years. + Peak.Demand.Reduction.Indicator,
## data = train_data[-influential, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.3 -12.4 0.0 12.7 59.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.0653824 3.9889575 14.31 <2e-16 ***
## Temperature -0.0460324 0.1284298 -0.36 0.720
## Building.Size.m.2. -0.0024316 0.0023477 -1.04 0.301
## HVAC.Consumption..kWh. 0.0176137 0.0707536 0.25 0.803
## Solar.Irradiance -0.0042491 0.0047510 -0.89 0.371
## Lighting.Consumption..kWh. -0.0143885 0.0949073 -0.15 0.880
## Energy.Price....kWh. -1.0414412 9.4405938 -0.11 0.912
## Building.Age..years. 0.0597152 0.0496530 1.20 0.229
## Peak.Demand.Reduction.Indicator -2.6699617 1.3944959 -1.91 0.056 .
## Temperature:Building.Size.m.2. 0.0000303 0.0001008 0.30 0.764
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.6 on 1398 degrees of freedom
## Multiple R-squared: 0.0067, Adjusted R-squared: 0.000309
## F-statistic: 1.05 on 9 and 1398 DF, p-value: 0.399
summary(mlr_interaction3)
##
## Call:
## lm(formula = Energy.Consumption..kWh. ~ Solar.Irradiance * HVAC.Consumption..kWh. +
## Temperature + Lighting.Consumption..kWh. + Energy.Price....kWh. +
## Building.Age..years. + Building.Size.m.2. + Peak.Demand.Reduction.Indicator,
## data = train_data[-influential, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.30 -12.48 0.01 12.64 59.97
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.629967 4.112461 14.01 <2e-16
## Solar.Irradiance -0.010259 0.012953 -0.79 0.428
## HVAC.Consumption..kWh. -0.059565 0.173057 -0.34 0.731
## Temperature -0.010577 0.047974 -0.22 0.826
## Lighting.Consumption..kWh. -0.014596 0.094890 -0.15 0.878
## Energy.Price....kWh. -1.177472 9.445454 -0.12 0.901
## Building.Age..years. 0.059189 0.049612 1.19 0.233
## Building.Size.m.2. -0.001773 0.000959 -1.85 0.065
## Peak.Demand.Reduction.Indicator -2.687065 1.394461 -1.93 0.054
## Solar.Irradiance:HVAC.Consumption..kWh. 0.000358 0.000728 0.49 0.623
##
## (Intercept) ***
## Solar.Irradiance
## HVAC.Consumption..kWh.
## Temperature
## Lighting.Consumption..kWh.
## Energy.Price....kWh.
## Building.Age..years.
## Building.Size.m.2. .
## Peak.Demand.Reduction.Indicator .
## Solar.Irradiance:HVAC.Consumption..kWh.
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.6 on 1398 degrees of freedom
## Multiple R-squared: 0.00681, Adjusted R-squared: 0.000417
## F-statistic: 1.07 on 9 and 1398 DF, p-value: 0.386
# Temperature x HVAC interaction plot
ggplot(train_data[-influential,],
aes(x = Temperature, y = Energy.Consumption..kWh., color = HVAC.Consumption..kWh.)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
labs(title = "Interaction between Temperature and HVAC Consumption",
x = "Temperature",
y = "Energy Consumption (kWh)",
color = "HVAC Consumption (kWh)")
# Temperature x Building Size interaction plot
ggplot(train_data[-influential,],
aes(x = Temperature, y = Energy.Consumption..kWh., color = Building.Size.m.2.)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
labs(title = "Interaction between Temperature and Building Size",
x = "Temperature",
y = "Energy Consumption (kWh)",
color = "Building Size (m²)")
# Solar Irradiance x HVAC interaction plot
ggplot(train_data[-influential,],
aes(x = Solar.Irradiance, y = Energy.Consumption..kWh., color = HVAC.Consumption..kWh.)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
labs(title = "Interaction between Solar Irradiance and HVAC Consumption",
x = "Solar Irradiance",
y = "Energy Consumption (kWh)",
color = "HVAC Consumption (kWh)")
Our analysis of the interaction models revealed several key findings that none of the tested interactions showed statistical significance:
ANOVA tests comparing models with and without interactions demonstrated no significant improvement:
The R-squared values remained consistently low (approximately 0.006-0.007) across all interaction models, indicating poor explanatory power.
The visualization of these relationships revealed:
These findings suggest that adding interaction terms did not enhance the model’s predictive capabilities. This leads to several possible interpretations about the relationships between variables that they are eithey truly independent of each other or they are related in a non-linear fashion or they are influenced by unmeasured factors not present in our dataset.
Right now original model without influential points is still better. Then we will use stepwise search , polynomial regression and exhaustive search trying find the better predictive model.
# 1.Load required libraries
library(leaps)
# 2. Create stepwise model
step_model = stepAIC(mlr_no_influential, direction = "both", trace = FALSE)
# 3. Create polynomial model
poly_model = lm(Energy.Consumption..kWh. ~
poly(Temperature, 2) +
poly(Solar.Irradiance, 2) +
poly(HVAC.Consumption..kWh., 2) +
poly(Building.Size.m.2., 2) +
Building.Age..years. +
Peak.Demand.Reduction.Indicator,
data = train_data[-influential,])
# 4. Create exhaustive search model
# Prepare data for exhaustive search
predictors = c("Temperature", "Solar.Irradiance", "HVAC.Consumption..kWh.",
"Lighting.Consumption..kWh.", "Building.Size.m.2.",
"Building.Age..years.", "Peak.Demand.Reduction.Indicator")
# Perform exhaustive search
exhaustive = regsubsets(Energy.Consumption..kWh. ~ .,
data = train_data[-influential, c("Energy.Consumption..kWh.", predictors)],
nvmax = length(predictors),
method = "exhaustive")
# Get summary of exhaustive search
exhaust_summary = summary(exhaustive)
# Find best model by adjusted R²
best_adjr2_idx = which.max(exhaust_summary$adjr2)
# Create best exhaustive model
best_vars = names(coef(exhaustive, best_adjr2_idx))[-1] # Remove intercept
formula_str = paste("Energy.Consumption..kWh. ~", paste(best_vars, collapse = " + "))
exhaustive_model = lm(as.formula(formula_str), data = train_data[-influential,])
# 5. Function to get model performance metrics (keeping existing function)
get_model_metrics = function(model, test_data) {
# Training metrics
train_resid = resid(model)
train_rmse = sqrt(mean(train_resid^2))
r2 = summary(model)$r.squared
adj_r2 = summary(model)$adj.r.squared
# Test metrics
test_pred = predict(model, newdata = test_data)
test_resid = test_data$Energy.Consumption..kWh. - test_pred
test_rmse = sqrt(mean(test_resid^2))
# Model diagnostics
shapiro_test = shapiro.test(train_resid)
bp_test = bptest(model)
return(c(
Train_RMSE = train_rmse,
Test_RMSE = test_rmse,
R_squared = r2,
Adj_R_squared = adj_r2,
AIC = AIC(model),
BIC = BIC(model),
Shapiro_p = shapiro_test$p.value,
BP_p = bp_test$p.value
))
}
# 6. Compare all models including exhaustive
models = list(
Original = mlr_no_influential,
Stepwise = step_model,
Polynomial = poly_model,
Exhaustive = exhaustive_model
)
model_metrics = t(sapply(models, get_model_metrics, test_data = test_data))
# Print formatted results
print("Model Comparison Results:")
## [1] "Model Comparison Results:"
print(round(model_metrics, 4))
## Train_RMSE Test_RMSE R_squared Adj_R_squared AIC BIC Shapiro_p
## Original 17.54 19.40 0.0066 0.0010 12083 12135 0.0067
## Stepwise 17.56 19.39 0.0049 0.0035 12073 12094 0.0091
## Polynomial 17.53 19.40 0.0079 0.0008 12085 12148 0.0060
## Exhaustive 17.55 19.41 0.0059 0.0038 12074 12100 0.0074
## BP_p.BP
## Original 0.0309
## Stepwise 0.0003
## Polynomial 0.0001
## Exhaustive 0.0009
# Visualize residual plots for all models
par(mfrow = c(2, 2))
for (name in names(models)) {
plot(fitted(models[[name]]), resid(models[[name]]),
main = paste(name, "Model Residuals"),
xlab = "Fitted values", ylab = "Residuals")
abline(h = 0, col = "red", lty = 2)
}
# Reset plotting parameters
par(mfrow = c(1, 1))
# Print summaries of significant variables for each model
print("Significant variables in each model (p < 0.05):")
## [1] "Significant variables in each model (p < 0.05):"
for (name in names(models)) {
cat("\n", name, "Model:\n")
coef_summary = summary(models[[name]])$coefficients
sig_vars = coef_summary[coef_summary[,4] < 0.05, ]
print(sig_vars)
}
##
## Original Model:
## Estimate Std. Error t value Pr(>|t|)
## 5.633e+01 3.150e+00 1.788e+01 1.398e-64
##
## Stepwise Model:
## Estimate Std. Error t value Pr(>|t|)
## 5.646e+01 1.229e+00 4.595e+01 3.597e-282
##
## Polynomial Model:
## Estimate Std. Error t value Pr(>|t|)
## 5.304e+01 1.212e+00 4.375e+01 4.884e-264
##
## Exhaustive Model:
## Estimate Std. Error t value Pr(>|t|)
## 5.513e+01 1.655e+00 3.331e+01 8.474e-180
# Print additional information about exhaustive search
cat("\nExhaustive Search Details:\n")
##
## Exhaustive Search Details:
cat("Best model size:", length(best_vars), "variables\n")
## Best model size: 3 variables
cat("Variables selected by exhaustive search:\n")
## Variables selected by exhaustive search:
print(best_vars)
## [1] "Building.Size.m.2." "Building.Age..years."
## [3] "Peak.Demand.Reduction.Indicator"
From the output,we can compare these model in terms of RMSE , r-squared and adjusted r-squared in statistical perspective and also do model diagnostics and compare AIC/BIC value to know which model’s complexity is better.
Model Performance Comparison:
All models show very similar performance metrics:
Train RMSE: ranges from 17.53 to 17.56
Test RMSE: ranges from 19.39 to 19.41
R-squared values are extremely low (all below 1%)
Adjusted R-squared values are even lower (all below 0.4%)
Model Diagnostics: All models fail the Shapiro-Wilk test (p < 0.05), indicating non-normal residuals and All models fail the Breusch-Pagan test (p < 0.05), indicating heteroscedasticity
The residual plots show similar patterns across all models
A fairly even spread around zero
Some potential heteroscedasticity (fan-shaped pattern)
No obvious non-linear patterns
Model Complexity: Based on AIC/BIC, Stepwise model has the lowest AIC (12073) and tepwise model has the lowest BIC (12094) which this suggests it achieves the best balance of fit and complexity
Exhaustive Search Results : 3 variables are selected that are Building Size, Building Age and Peak Demand Reduction Indicator which s suggests these are potentially the most important predictor
step_model
##
## Call:
## lm(formula = Energy.Consumption..kWh. ~ Building.Size.m.2. +
## Peak.Demand.Reduction.Indicator, data = train_data[-influential,
## ])
##
## Coefficients:
## (Intercept) Building.Size.m.2.
## 56.46473 -0.00181
## Peak.Demand.Reduction.Indicator
## -2.64837
Overall Conclusions: None of the models provide good predictive power (very low R-squared values and the simpler Stepwise model performs slightly better considering model complexity and the Exhaustive search confirms that only a few variables are meaningful predictors. Thus the temporary best model is stepwise model whose predictors are building size and Peak.Demand.Reduction.Indicator.
Given that all models perform similarly poorly, the simpler Stepwise model might be preferred for interpretability . The underlying relationships might be more complex than what linear models can capture.
This comprehensive analysis, despite its limitations, provides valuable insights for building energy management. While our models showed limited predictive power, they highlighted important factors affecting energy consumption. Future work should focus on gathering more detailed data and exploring more sophisticated modeling approaches to better understand and predict building energy consumption patterns.